8 - Gene Signatures - How to score & interpret them

Author

Marc Elosua Bayes

Published

March 8, 2024

Introduction

Gene signatures are commonly used in routine single cell analysis. Many methods exists but they are not all created equally. In this tutorial we are going to go follow a recent benchmarking paper Badia-i-Mompel et al. (2022) and follow their guidelines on best practices when scoring gene signatures!

With this tutorial we hope to familiarize you with the concepts of gene signatures, how they are scored in single cell datasets and how to interpret the scores obtained!

Before we start here are some key concepts that will help us and frame the vignette!

  • What is a gene signature?

    A “gene signature” can be stated as a single or a group of genes in a cell having a unique pattern of gene expression that is the consequence of either changed biological process or altered pathogenic medical terms Mallik and Zhao (2018).

  • What is a cell type signature?

    A cell type signature is a gene signature representing a group of genes underlying the biological processes characteristic of a cell type.

  • How do we score them in our dataset?

    Scoring a gene signature means to obtain a value for that signature for each cell in our datasets that represents how active the gene program is in each cell. There are many ways to score gene signatures as shown in the decoupleR paper Badia-i-Mompel et al. (2022). However, they do not all perform the same and it is important to select a robust method. The suggested method after their benchmarking analysis is running a Univariate Linear Model (ULM) where the gene expression values are the response variable and the regulator weights in the gene signature are the explanatory one (don’t worry, we’ll go through this in more detail in a second). The obtained t-value from the fitted model is the activity score of that gene signature in that cell.

  • How do we interpret that score?

    Scoring gene signatures using Univariate Linear Models and using the resulting t-value as the scoring metric allows us to simultaneously interpret in a single statistic the direction of activity (either + or -) and its significance (the magnitude of the score).

  • Can we interrogate the scores obtained?

    Yes! In fact it is very important to look past the score obtained by a cell and into which are the genes driving that score. Sometime with gene signatures containing 50 genes it could be that just a few genes are contributing to the signature. If we just stopped at the score we could be mislead into thinking that all of the genes making up the signature are important when it is actually only a fraction of them. Moreover, heterogeneous gene expression between two populations can also lead to 2 cells or populations having similar scores but vastly different genes gene programs underlying them.

Libraries

Load the libraries and install the packages needed to run this notebook

if (!requireNamespace("Seurat", quietly = TRUE))
    install.packages("Seurat")

if (!requireNamespace("tidyverse", quietly = TRUE))
    install.packages("tidyverse")

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes")
    
if (!requireNamespace("decoupleR", quietly = TRUE))
    remotes::install_github('saezlab/decoupleR')

if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
    BiocManager::install("ComplexHeatmap")

if (!requireNamespace("glue", quietly = TRUE))
    install.packages("glue")

if (!requireNamespace("colorBlindness", quietly = TRUE))
    install.packages("colorBlindness")

if (!requireNamespace("ggpmisc", quietly = TRUE))
    install.packages("ggpmisc")

if (!requireNamespace("circlize", quietly = TRUE))
    install.packages("circlize")

library(Seurat)
library(tidyverse)
library(decoupleR)
library(ComplexHeatmap)
library(colorBlindness)
library(ggpmisc)

# Remember to set a seed so the analysis is reproducible!
set.seed(687)

Load data

We’re going to be working with a dataset from the paper - Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19 Download data from cellxgene portal.

# Download the data in data/ directory
# download.file(
#     url = "https://datasets.cellxgene.cziscience.com/d8e35450-de43-451a-9979-276eac688bce.rds",
#     destfile = "../data/workshop-data.rds",
#     method = "wget",
#     extra = "-r -p --random-wait")
# We can also use the CLI with the wget command below
# wget https://datasets.cellxgene.cziscience.com/d8e35450-de43-451a-9979-276eac688bce.rds

se <- readRDS("../data/workshop-data.rds")
# Remove Uncategorized
se <- se[, ! se$Celltype %in% c("Uncategorized1", "Uncategorized2")]
se$Celltype <- as.character(se$Celltype)

Generate a color palette for our cell types

# https://www.datanovia.com/en/blog/easy-way-to-expand-color-palettes-in-r/
nb.cols <- length(unique(se$Celltype))
pal <- colorRampPalette(paletteMartin)(nb.cols)
names(pal) <- sort(unique(se$Celltype))

Analysis

Convert ENSEMBL IDs to Gene Symbols

Right away we can see how ensembl ids are used in the rownames. Let’s transform them into their matched symbols to make them human-readable:

head(rownames(se))
[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" "ENSG00000000460" "ENSG00000000938"

Convert to gene symbols

gene_df <- readr::read_csv(file = "../data/cov_flu_gene_names_table.csv")

symbol_id <- data.frame(index = rownames(se)) %>%
    left_join(gene_df, by = "index") %>%
    pull(feature_name)

# re-create seurat object
mtx <- se@assays$RNA$data
rownames(mtx) <- symbol_id
se <- CreateSeuratObject(counts = mtx, meta.data = se@meta.data)

Preprocessing

We will do a quick preprocessing of the data. 1) log-normalize, 2) identify highly variable genes, 3) scale their expression and 4) compute PCA on the scaled data.

se <- se %>%
    NormalizeData(verbose = FALSE) %>%
    FindVariableFeatures(nfeatures = 3000, verbose = FALSE) %>%
    ScaleData(verbose = FALSE) %>%
    RunPCA(verbose = FALSE)

Next we check the elbow plot to determine the number of PCs to use for the downstream analysis and then compute UMAP, K-nearest neighbor graph (KNN graph) and run Louvain clustering on it.

# Look at elbow plot to assess the number of PCs to use
ElbowPlot(se, ndims = 50)

We can see a clear elbow at 10 PCs, we’re going to extend it a bit more and use 15 PCs for the downstream analysis to make sure we are not loosing any biological signal

se <- RunUMAP(se, reduction = "pca", dims = 1:30, verbose = FALSE)

For the purpose of this tutorial we are going to be using the author’s provided annotation

(dim_plt <- DimPlot(
    se,
    group.by = "Celltype") +
    scale_color_manual(values = pal))

Gene Signature Scoring

Here we define some gene signatures based on prior knowledge. We are setting gene signature that are characteristic for specific cell types to score for their activities in our dataset.

bcell <- c("MS4A1", "CD79A", "CD79B", "BANK1", "HLA-DQB1", "HLA-DQA1")
tcell <- c("CD3D", "CD3E", "TRAC", "TRBC1", "TRBC2", "CD4", "CD8A", "CD8B")
tnaive <- c(tcell, "IL7R", "CCR7", "TCF7", "LEF1", "SELL")
cd8cyto <- c(tcell, "GZMA", "GZMK", "NKG7", "CCL5")
mono <- c("FCGR3A", "CD14", "S100A9", "S100A8", "MS4A7")
nks <- c("NCR1", "NCR2", "NCR3", "FCGR3A", "GZMA", "GZMK", "NKG7", "CCL5")

We can see how there are some genes that are specific for each signature but others are shared between them. This is important to take into account when computing the gene signatures and interpreting their scores.

To help us compute these gene signatures we are going to use the R package decoupleR from Bioconductor. decoupleR is a great for carrying out these analysis since it is a framework that contains different statistical methods to compute these scores. Ultimately we will obtain a score for each signature for each cell.

decoupleR requires the gene signatures to be passed as a dataframe so we are going to convert our gene signature vectors into a unified dataframe. mor stands for Mode Of Regulation, at the moment since we don’t have a score of how important that gene is for that signature we are going to weight them all equally with a value of 1.

sig_ls <- list("B cells" = bcell, "T cells" = tcell, "Naive T cells" = tnaive,
               "CD8 Cytotoxic" = cd8cyto, "Monocytes" = mono, "NKs" = nks)

sig_df <- lapply(names(sig_ls), function(i) {
    data.frame(
        signature = i,
        gene = sig_ls[[i]],
        mor = 1 
    )
}) %>% bind_rows()

sig_df[1:10, ]
   signature     gene mor
1    B cells    MS4A1   1
2    B cells    CD79A   1
3    B cells    CD79B   1
4    B cells    BANK1   1
5    B cells HLA-DQB1   1
6    B cells HLA-DQA1   1
7    T cells     CD3D   1
8    T cells     CD3E   1
9    T cells     TRAC   1
10   T cells    TRBC1   1

ULM

“Univariate Linear Model (ULM) fits a linear model for each sample and regulator, where the observed molecular readouts in mat are the response variable and the regulator weights in net are the explanatory one. Target features with no associated weight are set to zero. The obtained t-value from the fitted model is the activity ulm of a given regulator.”

Moreover, a nice thing about ulm is that in a single statistic it provides the direction of activity (either + or -) and its significance (the magnitude of the score). Making the scores very easy to interpret!

So lets compute the signature scores for every cell in our dataset!

ulm_start <- Sys.time()
res <- decoupleR::run_ulm(
    mat = se@assays$RNA$data,
    network = sig_df,
    .source = "signature",
    .target = "gene",
    .mor = "mor")

glue::glue("Time to run ulm is {round(difftime(Sys.time(), ulm_start, units = 's'), 0)} seconds")
Time to run ulm is 27 seconds
# remove densified memory created by decoupleR
suppressMessages(gc(verbose = FALSE)) # gc stands for garbage collection
            used   (Mb) gc trigger    (Mb) limit (Mb)   max used    (Mb)
Ncells   5888651  314.5   10364122   553.6         NA   10364122   553.6
Vcells 691412275 5275.1 5465424749 41697.9      98304 5700883630 43494.3

We can see how every cells has a score for every signature!

# Looking at the first 10 entries
res
# A tibble: 349,194 × 5
   statistic source  condition            score  p_value
   <chr>     <chr>   <chr>                <dbl>    <dbl>
 1 ulm       B cells AAACCCAAGAATGTTG-12  0.648 5.17e- 1
 2 ulm       B cells AAACCCAAGCATTGTC-19 -0.571 5.68e- 1
 3 ulm       B cells AAACCCAAGCTACGTT-6  -0.550 5.82e- 1
 4 ulm       B cells AAACCCAAGGCCGCTT-3   0.887 3.75e- 1
 5 ulm       B cells AAACCCAAGGCCTGCT-12  1.00  3.15e- 1
 6 ulm       B cells AAACCCAAGGGCAATC-1  16.4   4.37e-60
 7 ulm       B cells AAACCCAAGTAAGGGA-18  8.41  4.35e-17
 8 ulm       B cells AAACCCAAGTGTAGAT-5   0.730 4.65e- 1
 9 ulm       B cells AAACCCACAAGAGCTG-17  0.241 8.10e- 1
10 ulm       B cells AAACCCACACAATGCT-5   0.460 6.46e- 1
# ℹ 349,184 more rows
# Check the cell type for cell AAACCCAAGGGCAATC-1
se@meta.data[c("AAACCCAAGGGCAATC-1", "AAACCCAAGGCCTGCT-12"), "Celltype", drop = FALSE]
                        Celltype
AAACCCAAGGGCAATC-1  B cell, IgG+
AAACCCAAGGCCTGCT-12      NK cell
ct_b <- "AAACCCAAGGGCAATC-1"
ct_nk <- "AAACCCAAGGCCTGCT-12"

# Looking at all the scores for one specific cell
res %>% filter(condition %in% c(ct_b, ct_nk))
# A tibble: 12 × 5
   statistic source        condition            score  p_value
   <chr>     <chr>         <chr>                <dbl>    <dbl>
 1 ulm       B cells       AAACCCAAGGCCTGCT-12  1.00  3.15e- 1
 2 ulm       B cells       AAACCCAAGGGCAATC-1  16.4   4.37e-60
 3 ulm       T cells       AAACCCAAGGCCTGCT-12  0.722 4.70e- 1
 4 ulm       T cells       AAACCCAAGGGCAATC-1  -0.175 8.61e- 1
 5 ulm       Naive T cells AAACCCAAGGCCTGCT-12  0.278 7.81e- 1
 6 ulm       Naive T cells AAACCCAAGGGCAATC-1   1.01  3.10e- 1
 7 ulm       CD8 Cytotoxic AAACCCAAGGCCTGCT-12  3.61  3.13e- 4
 8 ulm       CD8 Cytotoxic AAACCCAAGGGCAATC-1  -0.433 6.65e- 1
 9 ulm       Monocytes     AAACCCAAGGCCTGCT-12  1.19  2.33e- 1
10 ulm       Monocytes     AAACCCAAGGGCAATC-1   1.41  1.60e- 1
11 ulm       NKs           AAACCCAAGGCCTGCT-12  4.71  2.49e- 6
12 ulm       NKs           AAACCCAAGGGCAATC-1  -0.711 4.77e- 1

We can see how this cell had been annotated as a B cell and when we look at the scores the B cell signature has the highest score with a significant p value.

Visualization

We can directly add the ulm scores to an assay in our object and visualize the results

se[['signatureulm']] <- res %>%
  pivot_wider(
      id_cols = 'source',
      names_from = 'condition',
      values_from = 'score') %>%
  column_to_rownames('source') %>%
  Seurat::CreateAssayObject(.)

# Change assay
DefaultAssay(object = se) <- "signatureulm"

# Scale the data for comparison across signatures 
se <- ScaleData(se)
se@assays$signatureulm@data <- se@assays$signatureulm@scale.data
UMAP visualization

Plot all the gene signatures one after the other

plt <- FeaturePlot(
    se,
    features = rownames(se@assays$signatureulm),
    ncol = 3) &
    scale_color_viridis_c(option = "magma")

plt + dim_plt

Heatmap by groups

We can also visualize the gene signature scores for each individual cell using a heatmap

DoHeatmap(
    se,
    features = rownames(se@assays$signatureulm),
    group.by = "Celltype",
    slot = "data",
    group.colors = RColorBrewer::brewer.pal(n = 7, name = "Dark2")) +
    scale_fill_viridis_c(option = "viridis") +
    labs(x = "Each column is a cell", y = "Signatures")

From the plot above we can see how we have very distinct populations in our datasets. We can also look at it a bit less granular by looking at the mean activity score per cluster.

# Extract activities from object as a long dataframe
df <- t(as.matrix(se@assays$signatureulm@data)) %>%
  as.data.frame() %>%
  mutate(cluster = se$Celltype) %>%
  pivot_longer(cols = -cluster, names_to = "source", values_to = "score") %>%
  group_by(cluster, source) %>%
  summarise(mean = mean(score))

# Transform to wide matrix
top_acts_mat <- df %>%
  pivot_wider(id_cols = 'cluster', names_from = 'source',
              values_from = 'mean') %>%
  column_to_rownames('cluster') %>%
  as.matrix()
# Choose color palette
palette_length = 100
my_color = colorRampPalette(c("#00008B", "white","red"))(palette_length)

# Show which is the max and min of the scaled value to make sure we set a scale that makes sense
glue::glue("Note that the maximum scaled value is: {round(max(top_acts_mat), 2)}, and the minimum is {round(min(top_acts_mat), 2)}.")
Note that the maximum scaled value is: 2.66, and the minimum is -0.87.
my_breaks <- c(seq(quantile(top_acts_mat, .01), 0, length.out=ceiling(palette_length/2) + 1),
               seq(0.05, quantile(top_acts_mat, .99), length.out=floor(palette_length/2)))
# Plot
ComplexHeatmap::pheatmap(top_acts_mat, border_color = NA, color=my_color, breaks = my_breaks)

Heatmap for gene expression

To fully grasp which genes are driving each gene signature within each cell we want to visualize the gene expression of the genes involved in each gene signature for each cell. We can do so using the ComplexHeatmap package and a little bit of data processing. For ease here is a function you can incorporate in your analysis:

geneHM <- function(
        object,
        sig_df,
        sig_name,
        sig_assay,
        .source,
        .target,
        sig_slot = "data",
        expr_assay = "RNA",
        expr_slot = "data",
        grouping = NULL,
        grouping_color = NULL,
        expr_cols = viridisLite::magma(100)) {
    
    # Extract Gene Expression Matrix from Seurat Object
    gene_expr <- GetAssayData(object, assay = expr_assay, layer = expr_slot)
    
    # Subset the genes of the signature from the Gene Expression Matrix
    genes_of_interest <- sig_df[, .target][which(sig_df[, .source] %in% sig_name)]
    
    # Subset the genes intersecting between gene expression and genes in signature
    g_int <- intersect(rownames(gene_expr), genes_of_interest)
    
    if (length(g_int) < length(genes_of_interest)) {
        genes_excluded <- genes_of_interest[!genes_of_interest %in% rownames(gene_expr)]
        genes_excluded <- paste(genes_excluded, collapse = ", ")
        message(paste0(
            "Genes ", genes_excluded,
            " are in the gene signature but not in the expression matrix,",
            " therefore, they have been excluded."))
    }
    
    # Subset expression matrix to only genes of interest
    gene_expr <- gene_expr[g_int, ]
    
    # Extract the Scores of the Signature of interest
    sig_score <- GetAssayData(object, assay = sig_assay, layer = sig_slot)
    sig_vec <- sig_score[sig_name, ]
    anno <- data.frame(score = sig_vec)
    # Make sure they are in the right order
    anno <- anno[colnames(gene_expr), , drop = FALSE]
    
    # Add any metadata if specified
    if (!is.null(grouping)) {
        meta <- object@meta.data[, grouping, drop = FALSE]
        anno <- cbind(anno, meta[rownames(anno), , drop = FALSE])
    }
    
    if (any(is.infinite(c(anno$score))))
        stop("There are scores with Inf values, please address this outside of this function. It could be because the slot used is scale_data.")
    
    # Make list of color to paint the annotation columns
    if (!is.null(grouping) & !is.null(grouping_color)) {
        score <- circlize::colorRamp2(
            breaks = c(quantile(anno$score, 0.01), 0, quantile(anno$score, 0.99)),
            colors = c("blue", "white", "red"))
        color_ls <- append(grouping_color, score)
        names(color_ls)[length(color_ls)] <- "score"
        
    } else {
        color_ls <- list(
            score = circlize::colorRamp2
            (breaks = c(min(anno$score), 0, max(anno$score)),
                      colors = c("blue", "white", "red")),
            annot = clust_color)
    }
    
    # Set the order from most expressing to least expressing 
    ord <- rownames(anno[order(anno$score, decreasing = TRUE), ])
    # Add the score of the signature as annotation in the heatmap
    colAnn <- HeatmapAnnotation(
        df = anno[ord, , drop = FALSE],
        which = 'column',
        col = color_ls
        )
    
    # Visualize the Heatmap with the genes and signature
    ht <- Heatmap(
        as.matrix(gene_expr[, ord]),
        name = "Gene Expression",
        col = expr_cols,
        cluster_rows = TRUE,
        cluster_columns = TRUE,
        column_title = sig_name,
        column_names_gp = gpar(fontsize = 14),
        show_column_names = FALSE,
      top_annotation = colAnn)
    
    # Return ComplexHeatmap
    draw(ht)
}

# Visualize the heatmaps for all signatures
# tt <- lapply(unique(sig_df$signature), function(i) {
#     geneHM(
#         object = se,
#         sig_df = sig_df,
#         .source = "signature",
#         .target = "gene",
#         sig_name = i,
#         expr_slot = "data",
#         expr_assay = "RNA",
#         sig_assay = "signatureulm",
#         sig_slot = "data",
#         grouping = c("RNA_snn_res.0.15"),
#         grouping_color = list(RNA_snn_res.0.15 = clust_color))
# })

Here are some examples of how to interpret these gene signatures:

  1. In the Monocyte signature not all cells that have the same score have the same genes expressed. For example, wee can see how among the cells with high scores there are cells that express CD14 with a gradient switching to expression of FCGR3A. Therefore, classifying all of these cells as the same just because the gene signature scoring returns the same value would be a mistake. In this case, a likely scenario could be that the gene signature isn’t teasing out differences between classical, intermediate, and non-classical monocytes.
# Subset to 10% of the original dataset for speed and visualization purposes
se_sub <- se[, sample(colnames(se), 0.1 * ncol(se))]
geneHM(
    object = se_sub,
    sig_df = sig_df,
    .source = "signature",
    .target = "gene",
    sig_name = "Monocytes",
    expr_slot = "data",
    expr_assay = "RNA",
    sig_assay = "signatureulm",
    sig_slot = "data",
    grouping = c("Celltype"),
    grouping_color = list(Celltype = pal))

  1. When looking at the CD8 Cytotoxic compartment we also observe how NK cells have a high score despite not expressing CD3 genes. This can be due to the simulatenous high expression of NKG7 and CCL5 in NKs and CD8 T cells.
geneHM(
    object = se_sub,
    sig_df = sig_df,
    .source = "signature",
    .target = "gene",
    sig_name = "CD8 Cytotoxic",
    expr_slot = "data",
    expr_assay = "RNA",
    sig_assay = "signatureulm",
    sig_slot = "data",
    grouping = c("Celltype"),
    grouping_color = list(Celltype = pal))

  1. This is more of a dummy example but when assessing the T cell signature c("CD3D", "CD3E", "TRAC", "TRBC1", "TRBC2", "CD4", "CD8A", "CD8B") CD8 T cells have a higher score than CD4 T cells. This is due to there being two CD8 genes (A & B) as well as CD8B being expressed at higher levels than CD4. Therefore, we need to keep an eye on these things to better interpret the heterogeneity within our populations.
geneHM(
    object = se_sub,
    sig_df = sig_df,
    .source = "signature",
    .target = "gene",
    sig_name = "T cells",
    expr_slot = "data",
    expr_assay = "RNA",
    sig_assay = "signatureulm",
    sig_slot = "data",
    grouping = c("Celltype"),
    grouping_color = list(Celltype = pal))

In summary, when scoring gene signatures and looking at their activities it is important to not only look at the oveall score obtained for each cell but one also needs to dive deeper into which are the genes that are driving that signature!

Extra!

How does a univariate linear model work?

Lets start with a toy example. Imagine a very simple scenario where we have two very simple vectors where one is double the other. We can compute the linear model and also easily visualize the relationship between both vectors:

# Define vectors of interest
vec1 <- c(1, 2, 5)
vec2 <- c(2.1, 3.8, 9.7)

# Run the linear model
summary(lm(vec2 ~ vec1))

Call:
lm(formula = vec2 ~ vec1)

Residuals:
       1        2        3 
 0.09231 -0.12308  0.03077 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.09231    0.16853   0.548   0.6810  
vec1         1.91538    0.05329  35.940   0.0177 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1569 on 1 degrees of freedom
Multiple R-squared:  0.9992,    Adjusted R-squared:  0.9985 
F-statistic:  1292 on 1 and 1 DF,  p-value: 0.01771
# Visualize the data
(p <- ggplot(mapping = aes(x = vec1, y = vec2)) +
    geom_point() +
    geom_abline(slope = 1, color = "red", linetype = "dashed") + # Passing a slope = 1
    coord_fixed() +
    xlim(0, 10) +
    ylim(0, 10) +
    theme_minimal())

# now we can add the slope of the line that best fits our data and the T value
p +
    stat_poly_line(formula = y ~ x, se = FALSE) +
    stat_poly_eq(use_label(c("eq"))) +
    stat_correlation(use_label(c("t")), label.x = 0.05, label.y = 0.9)

In the example above we see the linear relationship between both vectors and we get the slope and the T value:
- The slope indicates the what is the change in the response variable (vec2) given a 1 unit change in the predictor variable (vec1).

- The T statistic is the result of a T test. The T test assesses the significance of individual coefficients in our model. The T value indicates the number of standard errors the estimated coefficient is away from the null hypothesis (t = 0). Remember the T value is the \(\frac{coefficient}{standard~error}\).

Now lets look at a “real world” example, we want to score the B cell signature in one cell. First we are going to start by visualizing the relationship between the weights and the gene expression for 2 cells of interest, one is a B cell and the other is not.

We need to do a bit of data prep but bear with me!

# We have our gene expression matrix
mat <- se@assays$RNA$data

# We want to obtain a matrix with 1s and 0s indicating the weight each gene has for each signature
## Initialize mor_mat with all 0s
sources <- unique(sig_df$signature)
targets <- rownames(mat)
mor_mat <- matrix(0, ncol = length(sources), nrow = nrow(mat))
colnames(mor_mat) <- sources
rownames(mor_mat) <- targets
weights <- sig_df$mor

# Fill in the matrix with the weights in the right places
for (i in 1:nrow(sig_df)) {
    .source <- sig_df$signature[[i]]
    .target <- sig_df$gene[[i]]
    .weight <- weights[[i]]
    if (.target %in% targets) {
        mor_mat[[.target, .source]] <- .weight
    }
}
# labels for geom_text_repel
repel_text <- rownames(mat)
keep <- which(rownames(mat) %in% bcell)
# Set non-selected positions to NA
repel_text[-keep] <- NA

# Visualize the data
ggplot(mapping = aes(x = mat[, ct_b], y = mor_mat[, "B cells", drop = FALSE])) +
    geom_point() +
    ggrepel::geom_text_repel(aes(label = repel_text)) +
    geom_abline(slope = 1, color = "red", linetype = "dashed") + # Passing a slope = 1
    labs(x = "Gene Expression", y = "Gene Weight", title = glue::glue("Cell {ct_b} - B cell")) +
    coord_fixed() +
    xlim(0, 10) +
    ylim(0, 5) +
    theme_minimal() +
    # now we can add the slope of the line that best fits our data and the T value
    stat_poly_line(formula = x ~ y, se = FALSE) +
    stat_poly_eq(use_label(c("eq"))) +
    stat_correlation(use_label(c("t")), label.x = 0.05, label.y = 0.9)

# Visualize the data
ggplot(mapping = aes(x = mat[, ct_nk], y = mor_mat[, "B cells", drop = FALSE])) +
    geom_point() +
    ggrepel::geom_text_repel(aes(label = repel_text)) +
    geom_abline(slope = 1, color = "red", linetype = "dashed") + # Passing a slope = 1
    labs(x = "Gene Expression", y = "Gene Weight", title = glue::glue("Cell {ct_nk} - Not B cell")) +
    coord_fixed() +
    xlim(0, 7.5) +
    ylim(0, 2.5) +
    theme_minimal() +
    # now we can add the slope of the line that best fits our data and the T value
    stat_poly_line(formula = x ~ y, se = FALSE) +
    stat_poly_eq(use_label(c("eq"))) +
    stat_correlation(use_label(c("t")), label.x = 0.05, label.y = 0.9)

Next we are going to manually run the models for these two cells so that we can see that the results obtained from decoupleR make sense!

Check that the mor_mat has the weights in the right places

bcell
[1] "MS4A1"    "CD79A"    "CD79B"    "BANK1"    "HLA-DQB1" "HLA-DQA1"
mor_mat["MS4A1", ] # We can see how MS4A1 only has a weight in the B cell signature!
      B cells       T cells Naive T cells CD8 Cytotoxic     Monocytes           NKs 
            1             0             0             0             0             0 
mor_mat["CD3D", ] # We can see how CD3D has weights in all T cell signatures!
      B cells       T cells Naive T cells CD8 Cytotoxic     Monocytes           NKs 
            0             1             1             1             0             0 

Lets run the a linear model to score two cell for the B cell signature

mod1 <- lm(as.matrix(mat[, ct_b, drop = FALSE]) ~ mor_mat[, "B cells", drop = FALSE])
summary(mod1)

Call:
lm(formula = as.matrix(mat[, ct_b, drop = FALSE]) ~ mor_mat[, 
    "B cells", drop = FALSE])

Residuals:
    Min      1Q  Median      3Q     Max 
-0.3411 -0.0814 -0.0814 -0.0814  6.1124 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        0.081443   0.001778   45.80   <2e-16 ***
mor_mat[, "B cells", drop = FALSE] 2.167981   0.132333   16.38   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3241 on 33232 degrees of freedom
Multiple R-squared:  0.008012,  Adjusted R-squared:  0.007982 
F-statistic: 268.4 on 1 and 33232 DF,  p-value: < 2.2e-16
2.167981 / 0.132333 # This equals the T value
[1] 16.38277
mod2 <- lm(as.matrix(mat[, ct_nk, drop = FALSE]) ~ mor_mat[, "B cells", drop = FALSE])
summary(mod2)

Call:
lm(formula = as.matrix(mat[, ct_nk, drop = FALSE]) ~ mor_mat[, 
    "B cells", drop = FALSE])

Residuals:
    Min      1Q  Median      3Q     Max 
-0.2301 -0.0775 -0.0775 -0.0775  6.9463 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        0.077458   0.002042  37.928   <2e-16 ***
mor_mat[, "B cells", drop = FALSE] 0.152596   0.151992   1.004    0.315    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3723 on 33232 degrees of freedom
Multiple R-squared:  3.033e-05, Adjusted R-squared:  2.397e-07 
F-statistic: 1.008 on 1 and 33232 DF,  p-value: 0.3154
0.152596 / 0.151992 # This equals the T value
[1] 1.003974

We can see how mod1 has returned a high coefficient for cell TTTGCATGAGAGGC while mod2 has returned a low coefficient for cell AAGATGGAGATAAG. Moreover, when we look at the T value for the B cell signature we see how in mod1 it is 8.881 while for mod2 it is 0.697.

Lets check that the T values obtained manually actually match those returned by decoupleR

res %>% filter(source == "B cells" & condition %in% c(ct_b, ct_nk))
# A tibble: 2 × 5
  statistic source  condition           score  p_value
  <chr>     <chr>   <chr>               <dbl>    <dbl>
1 ulm       B cells AAACCCAAGGCCTGCT-12  1.00 3.15e- 1
2 ulm       B cells AAACCCAAGGGCAATC-1  16.4  4.37e-60

Effectively, they do!

In the above example we showed how to compute signature scores using ulm, if we take a closer look to the original decoupleR paper (badia-i-mompel 2022) we can see how in Supplementary Figure 3 ulm and mlm slightly outperform norm_wmean in terms of AUROC and AUPRC. Moreover, in the Bioconductor Vignettes they showcase the use of run_wmean instead of ulm. So… why use ulm instead of norm_wmean?


Run norm_wmean

In this section we are going to show how computing the normalized weighted mean with 1,000 permutations provides a very similar result to the ulm but takes much longer!.

Run decouple on our data using the wmean method. As mentioned in the details of the function: “WMEAN infers regulator activities by first multiplying each target feature by its associated weight which then are summed to an enrichment score wmean. Furthermore, permutations of random target features can be performed to obtain a null distribution that can be used to compute a z-score norm_wmean, or a corrected estimate corr_wmean by multiplying wmean by the minus log10 of the obtained empirical p-value.”.

wmean_start <- Sys.time()
res_wmean <- decoupleR::run_wmean(
    mat = se@assays$RNA$data,
    network = sig_df,
    .source = "signature",
    .target = "gene",
    .mor = "mor",
    times = 1000)

glue::glue("Time to run norm_wmean with 1000 iterations is {round(difftime(Sys.time(), wmean_start, units = 's'), 0)} seconds")

res_wmean

We obtain a long format tibble containing:

  • statistic - Indicating which method is associated with each score

    • wmean: multiplying each target feature by its associated weight which then are summed to an enrichment score

    • norm_wmean: permutations of random target features can be performed to obtain a null distribution that can be used to compute a z-score norm_wmean

    • corr_wmean: corrected estimate by multiplying wmean by the minus log10 of the obtained empirical p-value

  • source (aka - signature name)

  • condition - cell barcode

  • score - the signature score, the inferred biological activity.

  • p_value - P value obtained from permutations

Compare ulm with norm_wmean scores

res2 <- res %>%
    dplyr::select(statistic, source, score, condition)
colnames(res2) <- glue::glue("{colnames(res2)}_ulm")

res_wmean2 <- res_wmean %>%
    dplyr::filter(statistic == "norm_wmean") %>%
    dplyr::select(statistic, source, score, condition)
colnames(res_wmean2) <- glue::glue("{colnames(res_wmean2)}_wmean")

res2 %>%
    left_join(
        res_wmean2,
        by = c("condition_ulm" = "condition_wmean", "source_ulm" = "source_wmean")) %>%
    ggplot(aes(x = score_ulm, y = score_wmean)) +
    geom_point() +
    facet_wrap(~source_ulm, scales = "free") +
    stat_smooth(method = "lm", formula = y ~ x, geom = "smooth") +
    ggpubr::stat_cor(method = "pearson") +
    labs(x = "ulm", y = "norm_wmean") +
    theme_minimal()

Session Info

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggpmisc_0.5.4-1       ggpp_0.5.4            colorBlindness_0.1.9  ComplexHeatmap_2.16.0 decoupleR_2.9.1       lubridate_1.9.2       forcats_1.0.0         stringr_1.5.1         dplyr_1.1.4           purrr_1.0.2           readr_2.1.4           tidyr_1.3.1           tibble_3.2.1          ggplot2_3.5.0         tidyverse_2.0.0       Seurat_5.0.2          SeuratObject_5.0.1    sp_2.1-3             

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.22       splines_4.3.1          later_1.3.2            polyclip_1.10-6        confintr_1.0.2         fastDummies_1.7.3      lifecycle_1.0.4        doParallel_1.0.17      globals_0.16.2         lattice_0.21-8         vroom_1.6.3            MASS_7.3-60            magrittr_2.0.3         plotly_4.10.4          rmarkdown_2.25         yaml_2.3.8             remotes_2.4.2.1        httpuv_1.6.14          sctransform_0.4.1      spam_2.10-0            spatstat.sparse_3.0-3  reticulate_1.35.0.9000 cowplot_1.1.3          pbapply_1.7-2          RColorBrewer_1.1-3     abind_1.4-5            Rtsne_0.17             BiocGenerics_0.46.0    circlize_0.4.15        IRanges_2.34.1         S4Vectors_0.38.1       ggrepel_0.9.5          irlba_2.3.5.1          listenv_0.9.1          spatstat.utils_3.0-4   MatrixModels_0.5-2     goftest_1.2-3          RSpectra_0.16-1        spatstat.random_3.2-2  fitdistrplus_1.1-11    parallelly_1.37.0      leiden_0.4.3.1         codetools_0.2-19       tidyselect_1.2.0       shape_1.4.6            farver_2.1.1           matrixStats_1.2.0      stats4_4.3.1           spatstat.explore_3.2-6 jsonlite_1.8.8         GetoptLong_1.0.5      
 [52] ellipsis_0.3.2         progressr_0.14.0       ggridges_0.5.6         survival_3.5-7         iterators_1.0.14       foreach_1.5.2          tools_4.3.1            ica_1.0-3              Rcpp_1.0.12            glue_1.7.0             gridExtra_2.3          xfun_0.42              withr_3.0.0            BiocManager_1.30.22    fastmap_1.1.1          fansi_1.0.6            SparseM_1.81           digest_0.6.34          timechange_0.2.0       R6_2.5.1               mime_0.12              gridGraphics_0.5-1     colorspace_2.1-0       Cairo_1.6-1            scattermore_1.2        tensor_1.5             spatstat.data_3.0-4    utf8_1.2.4             generics_0.1.3         data.table_1.15.0      httr_1.4.7             htmlwidgets_1.6.4      uwot_0.1.16            pkgconfig_2.0.3        gtable_0.3.4           lmtest_0.9-40          htmltools_0.5.7        dotCall64_1.1-1        clue_0.3-64            scales_1.3.0           png_0.1-8              knitr_1.45             rstudioapi_0.15.0      tzdb_0.4.0             reshape2_1.4.4         rjson_0.2.21           nlme_3.1-163           zoo_1.8-12             GlobalOptions_0.1.2    KernSmooth_2.23-22     parallel_4.3.1        
[103] miniUI_0.1.1.1         pillar_1.9.0           vctrs_0.6.5            RANN_2.6.1             promises_1.2.1         xtable_1.8-4           cluster_2.1.4          evaluate_0.23          magick_2.7.5           cli_3.6.2              compiler_4.3.1         rlang_1.1.3            crayon_1.5.2           future.apply_1.11.1    labeling_0.4.3         plyr_1.8.9             stringi_1.8.3          viridisLite_0.4.2      deldir_2.0-2           munsell_0.5.0          lazyeval_0.2.2         spatstat.geom_3.2-8    quantreg_5.97          Matrix_1.6-5           RcppHNSW_0.6.0         hms_1.1.3              patchwork_1.2.0        bit64_4.0.5            future_1.33.1          shiny_1.8.0            ROCR_1.0-11            igraph_2.0.2           bit_4.0.5              polynom_1.4-1         

References

Badia-i-Mompel, Pau, Jesús Vélez Santiago, Jana Braunger, Celina Geiss, Daniel Dimitrov, Sophia Müller-Dott, Petr Taus, et al. 2022. “decoupleR: Ensemble of Computational Methods to Infer Biological Activities from Omics Data.” Edited by Marieke Lydia Kuijjer. Bioinformatics Advances 2 (1). https://doi.org/10.1093/bioadv/vbac016.
Mallik, Saurav, and Zhongming Zhao. 2018. “Identification of Gene Signatures from RNA-Seq Data Using Pareto-Optimal Cluster Algorithm.” BMC Systems Biology 12 (S8). https://doi.org/10.1186/s12918-018-0650-2.